June 29, 2016

Today's Overview

  • Describe the population of interest
  • Discuss the problem
  • Review the methodology and evaluation
  • Shiny R Markdown implementation

Probationer Population

  • Mostly male
  • Mostly not murderers (>99%), but dangerous
  • State prisoners that were released on probation early

Murderers by Location in LA County

Where did this project start?

Fig. 1 from Berk's study

n = 30,000

Feature selection

Near Zero Variance

##                 freqRatio percentUnique zeroVar   nzv
## Murder         129.421875    0.01198035   FALSE  TRUE
## Age              1.053819    0.38936145   FALSE FALSE
## White            4.662822    0.01198035   FALSE FALSE
## Male             8.717113    0.01198035   FALSE FALSE
## ZIP              1.273859    3.73187972   FALSE FALSE
## Total_Pop        1.273859    3.54618426   FALSE FALSE
## Black_Pop        1.273859    3.28860669   FALSE FALSE
## Prop_Black       1.273859    3.42638074   FALSE FALSE
## Income           1.273859    3.51623338   FALSE FALSE
## PRIMARY CHARGE   1.296763    2.59374626   FALSE FALSE
## Gang             1.883745    0.01198035   FALSE FALSE
## RegisterSO      50.684211    0.01198035   FALSE  TRUE
## ViolentCase     11.704718    0.01198035   FALSE FALSE
## WeaponCase     104.658228    0.01198035   FALSE  TRUE
## DrugCase       537.516129    0.01198035   FALSE  TRUE
## MH               3.310354    0.01198035   FALSE FALSE
## Zip_Present      7.159335    0.01198035   FALSE FALSE

Model 1

fit <- randomForest(Murder ~ Age + White + Male + Total_Pop + 
                        Black_Pop + Prop_Black + Income + 
                        Zip_Present + Gang + ViolentCase, 
                    data = train, 
                    importance = TRUE, 
                    ntree = 1500)

Model 1 ROC

Model 1 Variable Importance

Model 2

fit2 <- randomForest(Murder ~ Age + Total_Pop + Black_Pop + 
                         Prop_Black + Income + Zip_Present + 
                         ViolentCase, 
                    data = train, 
                    importance = TRUE, 
                    ntree = 1500,
                    mtry = 2,
                    cutoff = c(0.65, 0.30),
                    sampsize = c("0" = 100, "1" = 34),
                    strata = as.factor(train$Murder),
                    keep.inbag = TRUE,
                    na.action = na.roughfix)

Model 2 ROC

Model 2 Variable Importance

Confusion Matrix

Model 1

##                
## pred            Non-Murderers Murderers
##   Non-Murderers          6602        50
##   Murderers                 3         1

Model2

##               
## pred           Non-Murderer Murderer
##   Non-Murderer         5651       12
##   Murderer              954       39

Ongoing Evaluation

  • Context, context, context
  • False negatives are to be avoided
  • Comparison to logistic regression and risk assessment tools

Comparison to other models

  • Risk assessment tool has 43% accuracy, 32% false negatives

  • Logistic regression: 99% accuracy, 90% false negatives, low sensitivity
    • Berk's false negative rate was 99.7%

Shiny R Markdown Implementation

2,300 early releases from probation

Algorithms in the news

References

  • Berk, R., Sherman, L., Barnes, G., Kurtz, E., & Ahlman, L. (2009). Forecasting murder within a population of probationers and parolees: a high stakes application of statistical learning. Journal of the Royal Statistical Society: Series A (Statistics in Society), 172(1), 191-211.
  • Xavier Robin, Natacha Turck, Alexandre Hainard, Natalia Tiberti, Frédérique Lisacek, Jean-Charles Sanchez and Markus Müller (2011). pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinformatics, 12, p. 77. DOI: 10.1186/1471-2105-12-77.